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10 Minutes to pandas 


This is a short introduction to pandas, geared mainly for new users. You can see more complex 
recipes in the Cookbook 


Customarily, we import as follows: 


In [1]: import pandas as pd 


In [2]: import numpy as np 


In [3]: import matplotlib.pyplot as plt 


Object Creation 


See the Data Structure Intro section 


Creating a Series by passing a list of values, letting pandas create a default integer index: 


In [4]: s = pd.Series([1,3,5,np.nan,6,8]) 


In [5]: s 
Out[5]: 
0 1 
1 3 
2 5 
3 NaN 
4 6 
5 8 


dtype: float64 


Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns: 


In [6]: dates - pd.date range('20130101', periods-6) 


In [7]: dates 
Out[7]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', 
'2013-01-05', '2013-01-06'], 
dtype='datetime64[ns]', freg-'D') 


In [8]: df = pd.DataFrame(np.random.randn(6,4), index-dates, columns-list( ABCD') 


In [9]: df 
Out[9]: 

A B C D 
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 
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2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 


Creating a DataFrame by passing a dict of objects that can be converted to series-like. 


In [10]: df2 = pd.DataFrame({ 'A' ery 
aa “Be 
——_ Cc 
AET 'D' 
a “Er 
swari 'F' "OO 
In [11]: df2 
Out[11]: 
A B C D E F 
0 1 2013-01-02 1 3 test foo 
1 1 2013-01-02 1 3 train foo 
2 1 2013-01-02 1 3 test foo 
3 1 2013-01-02 1 3 train foo 


pd.Timestamp('20130102'), 
pd.Series(1,index=list(range(4)),dtype='float 
np.array([3] * 4,dtype='int32'), 
pd.Categorical(["test","train","test", "train" 


}) 


Having specific dtypes 


In [12]: df2.dtypes 
Outj L2]s 


B 
C 
D 
E 
F 


float64 
datetime64[ns] 
float32 

int32 

category 
object 


dtype: object 


If you’re using IPython, tab completion for column names (as well as public attributes) is 
automatically enabled. Here’s a subset of the attributes that will be completed: 


In [13]: 
df2. 


df2. 
df2. 
df2. 
df2. 
df2. 
df2. 
df2. 
df2. 
df2. 
df2. 
df2. 
df2. 
df2. 


df2.<TAB> 


add 

add prefix 
add suffix 
align 

all 

any 

append 
apply 
applymap 
as blocks 
asfreg 

as matrix 


df2 
df2 
df2 
df2 
df2 
df2 
df2 
df2 
df2 
df2 
df2 
df2 
df2 
df2 
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.boxplot 

Ç 

.clip 

.clip lower 
.clip upper 
.columns 
.combine 
.combineAdd 
.combine first 
.combineMult 

. compound 
.consolidate 
.convert_objects 
. Copy 
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df2.astype df2.corr 
df2.at df2.corrwith 
df2.at_ time df2.count 
df2.axes df2.cov 
df2.B df2.cummax 
df2.between_time df2.cummin 
df2.bfill df2.cumprod 
df2.blocks df2.cumsum 
df2.bool df2.D 


As you can see, the columns a, B, c, and D are automatically tab completed. E is there as well; the 
rest of the attributes have been truncated for brevity. 


Viewing Data 


See the Basics section 


See the top & bottom rows of the frame 


In [14]: df.head() 
Out[14]: 

A B C D 
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 


In [15]: df.tail(3) 
Out[15]: 

A B C D 
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 
2013-01-05 -0.424972 .567020 0.276232 -1.087401 


o 


2013-01-06 -0.673690 0.113648 -1.478427 0.524988 


Display the index, columns, and the underlying numpy data 


In [16]: df.index 
Out[lb]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', 
'2013-01-05', '2013-01-06'], 
dtype-'datetime64[ns]', freg-'D') 


In [17]: df.columns 
Out[17]: Index([u'A', u'B', u'C', u'D'], dtype-'object') 


In [18]: df.values 
Out[18]: 

array([[ 0.4691, -0.2829, -1.5091, -1.1356], 
[ 1.2121, -0.1732, 0.1192, -1.0442], 
[-0.8618, -2.1046, -0.4949, 1.0718], 
[ 0 


-7216, -0.7068, -1.0396, 0.2719], 
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[-0.425 , 0.567 , 0.2762, -1.0874], 
[-0.6737, 0.1136, -1.4784, 0.525 ]]) 


Describe shows a quick statistic summary of your data 


In [19]: df.describe() 


Out[19]: 

A B c D 
count 6.000000 6.000000 6.000000 6.000000 
mean .073711 -0.431125 -0.687758 -0.233103 


6 

0 

0.843157 0.922818 0.779887 0.973118 
min -0.861849 -2.104569 -1.509059 -1.135632 

0.611510 -0.600794 -1.368714 -1.076610 

0.022070 -0.228039 -0.767252 -0.386188 

0.658444 0.041933 -0.034326 0.461706 

1 


0212112 0.567020 0.276232 1.071804 


Transposing your data 


In [20]: df.T 
Out[20]: 
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 


A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690 
B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648 
Cc -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 
D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988 


Sorting by an axis 


In [21]: df.sort_index(axis=1, ascending=False) 
Out[21]: 


D C B A 
2013-01-01 -1.135632 -1.509059 -0.282863 0.469112 
2013-01-02 -1.044236 0.119209 -0.173215 1.212112 
2013-01-03 1.071804 -0.494929 -2.104569 -0.861849 
2013-01-04 0.271860 -1.039575 -0.706771 0.721555 
2013-01-05 -1.087401 0.276232 0.567020 -0.424972 
2013-01-06 0.524988 -1.478427 0.113648 -0.673690 


Sorting by values 


In [22]: df.sort_values(by-'B') 
Out[22]: 


A B C D 
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 
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Selection 
Note: While standard Python / Numpy expressions for selecting and setting are intuitive and 
come in handy for interactive work, for production code, we recommend the optimized pandas 


data access methods, .at, .iat, .loc, .iloc and .ix. 


See the indexing documentation /ndexing and Selecting Data and Multilndex / Advanced Indexing 


Getting 


Selecting a single column, which yields a Series, equivalent to df.A 


In [23]: df['A'] 


Out[23]s 
2013-01-01 0.469112 
2013-01-02 1.212112 


2013-01-03 -0.861849 
2013-01-04 0.721555 
2013-01-05 -0.424972 
2013-01-06 -0.673690 
Freg: D, Name: A, dtype: float64 


Selecting via [ 1, which slices the rows. 


In [24]: df[0:3] 
Out[24]: 

A B C D 
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 


In [25]: df['20130102':'20130104'] 
Out[25]: 


A B Ç D 
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 


Selection by Label 


See more in Selection by Label 


For getting a cross section using a label 


In [26]: df.loc[dates[0]] 
Out[26]: 
A 0.469112 
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B -0.282863 
C -1.509059 
D -1.135632 
Name: 2013-01-01 00:00:00, dtype: float64 


Selecting on a multi-axis by label 


In [27]: df.loc[:,['A','B']] 
Out[27]: 


A B 
2013-01-01 0.469112 -0.282863 
2013-01-02 1.212112 -0.173215 
2013-01-03 -0.861849 -2.104569 
2013-01-04 0.721555 -0.706771 
2013-01-05 -0.424972 0.567020 
2013-01-06 -0.673690 0.113648 


Showing label slicing, both endpoints are included 


In [28]: df.loc['20130102':'20130104',['A','B']] 
Out[28]: 


A B 
2013-01-02 1.212112 -0.173215 
2013-01-03 -0.861849 -2.104569 
2013-01-04 0.721555 -0.706771 


Reduction in the dimensions of the returned object 


In [29]: df.loc['20130102',['A','B']] 


Out[29]: 
A 1,212112 
B -0.173215 


Name: 2013-01-02 00:00:00, dtype: float64 


For getting a scalar value 


In [30]: df.loc[dates[0],'A'] 
Out[30]: 0.46911229990718628 


For getting fast access to a scalar (eguiv to the prior method) 


In [31]: df.at[dates[0],'A'] 
Out[31]: 0.46911229990718628 


Selection by Position 
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See more in Selection by Position 


Select via the position of the passed integers 


In [32]: df.iloc[3] 


Hurl 32] 
A 0.721555 
B -0.706771 


Ç -1.039575 
D 0.271860 
Name: 2013-01-04 00:00:00, dtype: float64 


By integer slices, acting similar to numpy/python 


In [33]: df.iloc[3:5,0:2] 
Out[33]: 


A B 
2013-01-04 0.721555 -0.706771 
2013-01-05 -0.424972 0.567020 


By lists of integer position locations, similar to the numpy/python style 


In [34]: df.iloc[[1,2,4],[0,2]] 
Out[34]: 


A C 
2013-01-02 1.212112 0.119209 
2013-01-03 -0.861849 -0.494929 
2013-01-05 -0.424972 0.276232 


For slicing rows explicitly 


In [35]: df.iloc[1:3,:] 
Out[35]: 


A B C D 
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 


For slicing columns explicitly 


In [36]: df.iloc[:,1:3] 
Out[36]: 


B C 
2013-01-01 -0.282863 -1.509059 
2013-01-02 -0.173215 0.119209 
2013-01-03 -2.104569 -0.494929 
2013-01-04 -0.706771 -1.039575 
2013-01-05 0.567020 0.276232 
2013-01-06 0.113648 -1.478427 
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For getting a value explicitly 


In [37]: df.iloc[1,1] 
Out[37]: -0.17321464905330858 


For getting fast access to a scalar (equiv to the prior method) 


In [38]: df.iat[1,1] 
Out[38]: -0.17321464905330858 


Boolean Indexing 


Using a single column’s values to select data. 


In [39]: df[df.A > 0] 
Out[39]: 


A B 
2013-01-01 0.469112 -0.282863 
2013-01-02 1.212112 -0.173215 
2013-01-04 0.721555 -0.706771 


A where operation for getting. 


In [40]: df[df > 0] 


Out[ 40]: 

A B 
2013-01-01 0.469112 NaN 
2013-01-02 1.212112 NaN 
2013-01-03 NaN NaN 
2013-01-04 0.721555 NaN 
2013-01-05 NaN 0.567020 
2013-01-06 NaN 0.113648 


Using the isin() method for filtering: 


In [41]: df2 = df.copy() 


In [42]: df2['E'] = ['one', 'one','two', 'three',' 'four','three'] 


In [43]: df2 
Out[43]: 

A B 
2013-01-01 0.469112 -0.282863 
2013-01-02 1.212112 -0.173215 
2013-01-03 -0.861849 -2.104569 
2013-01-04 0.721555 -0.706771 
2013-01-05 -0.424972 0.567020 
2013-01-06 -0.673690 0.113648 
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C D 
.509059 -1.135632 
0.119209 -1.044236 
-039575 0.271860 


-1 


0 


0 


Cc 

NaN 

. 119209 
NaN 1. 
NaN 0. 

.276232 
NaN O. 


C 
.509059 -1. 
.119209 -1. 
.494929 1. 
-039575 0. 
.276232 -1. 
.A78427 0. 


D 

NaN 
NaN 
071804 
271860 
NaN 
524988 


D 
135632 
044236 
071804 
271860 
087401 
524988 
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In [44]: df2[df2['E'].isin(['two','four'])] 
Out[ 44]: 


A B C D E 
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four 


Setting 


Setting a new column automatically aligns the data by the indexes 


In [45]: sl = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6) 


In [46]: s1 
Out[46]: 

2013-01-02 
2013-01-03 
2013-01-04 
2013-01-05 
2013-01-06 
2013-01-07 
Freq: D, dtype: int64 


NU RUNE 


In [47]: df['F'] = s1 


Setting values by label 


In [48]: df.at[dates[0],'A'] = 0 | 


Setting values by position 

In [49]: df.iat[0,1] = 0 | 
Setting by assigning with a numpy array 

In [50]: df.loc[:,'D'] = np.array([5] * len(df)) 


The result of the prior setting operations 


In [51]: df 
Out[51]: 

A B CD F 
2013-01-01 0.000000 0.000000 -1.509059 5 NaN 
2013-01-02 1.212112 -0.173215 0.119209 5 1 
2013-01-03 -0.861849 -2.104569 -0.494929 5 2 
2013-01-04 0.721555 -0.706771 -1.039575 5 3 
2013-01-05 -0.424972 0.567020 0.276232 5 4 


http://pandas.pydata.org/pandas-docs/stable/10min.html 9/26 


1/2/2016 10 Minutes to pandas — pandas 0.17.1 documentation 
2013-01-06 -0.673690 0.113648 -1.478427 5 5 


A where operation with setting. 


In [52]: df2 = df.copy() 
In [53]: df2[df2 > 0] = -df2 


In [54]: df2 
Out[54]: 


A B C: -D F 
2013-01-01 0.000000 0.000000 -1.509059 -5 NaN 
2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1 
2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2 
2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3 
2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4 
2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5 


Missing Data 


pandas primarily uses the value np.nan to represent missing data. It is by default not included in 
computations. See the Missing Data section 


Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of 
the data. 


In [55]: df1 = df.reindex(index-dates[0:4], columns=list(df.columns) + ['E']) 
In [56]: df1l.loc[dates[0]:dates[1],'E'] = 1 


In [57]: dfl 


Out[a7]s: 

A B Cc -D F E 
2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1 
2013-01-02 1.212112 -0.173215 0.119209 5 1 1 
2013-01-03 -0.861849 -2.104569 -0.494929 5 2 NaN 
2013-01-04 0.721555 -0.706771 -1.039575 5 3 NaN 


To drop any rows that have missing data. 


In [58]: dfl.dropna(how-'any') 
Out[58]: 


A B C D 
2013-01-02 1.212112 -0.173215 0.119209 5 


e hj 
=. bj 


Filling missing data 


In [59]: dfl.fillna(value-5) 


http://pandas.pydata.org/pandas-docs/stable/10min.html 10/26 


1/2/2016 10 Minutes to pandas — pandas 0.17.1 documentation 
Out[59]: 


A B C 
2013-01-01 0.000000 0.000000 -1.509059 
2013-01-02 1.212112 -0.173215 0.119209 
2013-01-03 -0.861849 -2.104569 -0.494929 
2013-01-04 0.721555 -0.706771 -1.039575 


t o o a g 
WNr U hj 
UU — i] 


To get the boolean mask where values are nan 


In [60]: pd.isnull(dfl) 
Out[60]: 


A B Cc D F E 
2013-01-01 False False False False True False 
2013-01-02 False False False False False False 
2013-01-03 False False False False False True 
2013-01-04 False False False False False True 


Operations 


See the Basic section on Binary Ops 


Stats 


Operations in general exclude missing data. 


Performing a descriptive statistic 


In [61]: df.mean() 


Out[61]: 
A -0.004474 
B -0.383981 
C -0.687758 
D 5.000000 
F 3.000000 


dtype: float64 


Same operation on the other axis 


In [62]: df.mean(1) 


Out[62]: 
2013-01-01 0.872735 
2013-01-02 1.431621 
2013-01-03 0-707731 
2013-01-04 1.395042 
2013-01-05 1.883656 
2013-01-06 1.592306 


Freq: D, dtype: float64 
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Operating with objects that have different dimensionality and need alignment. In addition, pandas 
automatically broadcasts along the specified dimension. 


In [63]: 

In [64]: s 
Out[64]: 
2013-01-01 NaN 
2013-01-02 NaN 
2013-01-03 1 
2013-01-04 3 
2013-01-05 5 
2013-01-06 NaN 


Freq: D, dtype: 


In [65]: df.sub(s, 


Out[65]: 


2013-01-01 
2013-01-02 


A 
NaN 
NaN 


2013-01-03 -1.861849 
2013-01-04 -2.278445 
2013-01-05 -5.424972 
2013-01-06 


Apply 


NaN 
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float64 


NaN 
-3.104569 
-3.706771 
-4.432980 

NaN 


Applying functions to the data 


In [66]: df.apply(np. 


Out [66]: 


2013-01-01 
2013-01-02 
2013-01-03 
2013-01-04 
2013-01-05 
2013-01-06 - 


In [67]: df.apply(lambda x: x.max() 


0. 
L; 
0. 
Le 
0. 
0. 


Out[67]: 


A 
B 
C 
D 
F 


dtype: 


& OF ND 


2. 
-671590 
- 785291 
-000000 
-000000 


073961 


float64 


A 
000000 
212112 
350263 
071818 
646846 
026844 


Histogramming 


cumsum) 


B 
0.000000 
-0.173215 
-2.277784 
-2.984555 
-2.417535 
-2.303886 


axis='index' ) 


C 


D 


s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2) 


E 


NaN NaN NaN 
NaN NaN NaN 


-1.494929 
-4.039575 
-4.723768 

NaN 


C D F 
-1.509059 5 NaN 
-1.389850 10 1 
-1.884779 15 3 
-2.924354 20 6 
-2.648122 25 10 
-4.126549 30 15 

- x.min()) 


See more at Histogramming and Discretization 
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2 0 
0 -1 
NaN NaN 
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In [68]: s = pd.Series(np.random.randint(0, 


In [69]: s 
Out[69]: 
4 


WOANADU BP WNHE ns 
PEW EB ANEF NY 


dtype: int32 


In [70]: s.value_counts() 


Out[ 70]: 
4 5 
6 2 
2 2 
1 1 


dtype: int64 


String Methods 


Series is equipped with a set of string processing methods in the str attribute that make it easy to 
operate on each element of the array, as in the code snippet below. Note that pattern-matching in 
str generally uses regular expressions by default (and in some cases always uses them). See more 


at Vectorized String Methods. 


In [71]: s = pd.Series(['A', 'B', 'C', 


In [72]: s.str.lower() 
Out[72]: 

a 

b 

C 
aaba 
baca 
NaN 
caba 
dog 
cat 
dtype: object 


o- AU ws O MDO 


Merge 


Concat 
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pandas provides various facilities for easily combining together Series, DataFrame, and Panel 
objects with various kinds of set logic for the indexes and relational algebra functionality in the case 
of join / merge-type operations. 


See the Merging section 


Concatenating pandas objects together with concat(): 


In [73]: df = pd.DataFrame(np.random.randn(10, 4)) 


In [74]: df 
Out[74]: 

0 1 2 3 
0 -0.548702 1.467327 -1.015962 -0.483075 
1 1.637550 -1.217659 -0.291519 -1.745505 
2 -0.263952 0.991460 -0.919069 0.266046 
3 -0.709661 1.669052 1.037882 -1.705775 
4 -0.919854 -0.042379 1.247642 -0.009920 
5 0.290213 0.495767 0.362949 1.548106 
6 -1.131345 -0.089329 0.337863 -0.945867 
7 -0.932132 1.956030 0.017587 -0.016692 
8 -0.575247 0.254161 -1.143704 0.215897 
9 1.193555 -0.077118 -0.408530 -0.862495 


# break it into pieces 
In [75]: pieces = [df[:3], df[3:7], df[7:]] 


In [76]: pd.concat(pieces) 


Out[76]: 

0 1 2 3 
0 -0.548702 1.467327 -1.015962 -0.483075 
1 1.637550 -1.217659 -0.291519 -1.745505 
2 -0.263952 0.991460 -0.919069 0.266046 
3 -0.709661 1.669052 1.037882 -1.705775 
4 -0.919854 -0.042379 1.247642 -0.009920 
5 0.290213 0.495767 0.362949 1.548106 
6 -1.131345 -0.089329 0.337863 -0.945867 
7 -0.932132 1.956030 0.017587 -0.016692 
8 -0.575247 0.254161 -1.143704 0.215897 
9 1.193555 -0.077118 -0.408530 -0.862495 


Join 


SQL style merges. See the Database style joining 


In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) 
In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]}) 


In [79]: left 


Out[79]: 

key lval 
0 foo 1 
1 foo 2 
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In [80]: right 


Out[80]: 

key rval 
0 foo 4 
1 foo 5 


In [81]: pd.merge(left, right, on-'key') 


Out[81]: 

key lval rval 
0 foo 1 4 
1 foo 1 5 
2 foo 2 4 
3 foo 2 5 


Append 


Append rows to a dataframe. See the Appending 


In [82]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D']) 


In [83]: df 
Out[ 83]: 

A B C D 
O 1.346061 1.511763 1.627081 -0.990582 
1 -0.441652 1.211526 0.268520 0.024580 
2 -1.577585 0.396823 -0.105381 -0.532532 
3 1.453749 1.208843 -0.080952 -0.264610 
4 -0.727965 -0.589346 0.339969 -0.693205 
5 -0.339355 0.593616 0.884345 1.591431 
6 0.141809 0.220390 0.435589 0.192451 
7 -0.096701 0.803351 1.715071 -0.708758 


In [84]: s = df.iloc[3] 


In [85]: df.append(s, ignore_index=True) 
Out[85]: 

A B C D 
1.346061 1.511763 1.627081 -0.990582 
-0.441652 1.211526 0.268520 0.024580 
-1.577585 0.396823 -0.105381 -0.532532 
1.453749 1.208843 -0.080952 -0.264610 

.727965 -0.589346 0.339969 -0.693205 
-0.339355 0.593616 0.884345 1.591431 
0.141809 0.220390 0.435589 0.192451 
-0.096701 0.803351 1.715071 -0.708758 
1.453749 1.208843 -0.080952 -0.264610 


0 
1 
2 
3 
4 -0 
5 
6 
7 
8 


Grouping 
By “group by” we are referring to a process involving one or more of the following steps 


e Splitting the data into groups based on some criteria 
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e Applying a function to each group independently 
e Combining the results into a data structure 


See the Grouping 


In [86]: df - 


In [87]: df 


Out[87]: 

A B 
0 foo one 
1 bar one 
2 foo two 
3 bar three 
4 foo two 
5 bar two 
6 foo one 
7 foo three 


section 


pd.DataFrame(('A' 


C 
-1.202872 
-1.814470 

1.018601 
-0.595447 
1.395433 
-0.392670 
0.007207 
1.928123 


D 
-0.055224 
2.395985 
1.552825 
0.166599 
0.047609 
-0.136473 
-0.561757 
-1.623033 


[' £00", 
"£00"; 
['one', 


two , 


/ 


/ 


d 


' 


'foo', 
"foo; 
'two', 
'one', 


np.random.randn(8), 
np.random.randn(8))) 


Grouping and then applying a function sum to the resulting groups. 


In [88]: df.groupby('A').sum() 


Out[88]: 
C 


A 
bar -2.802588 
foo 3.146492 


Grouping by multiple columns forms a hierarchical index, which we then apply the function. 


D 


2.42611 
-0.63958 


In [89]: df.groupby(['A','B']).sum() 


Out[a9]: 

A B 

bar one -1. 
three -0. 
two -0. 

foo one -1. 
three 1. 
two Dis 


Reshaping 


C 
814470 2. 
595447 0. 
392670 -0. 
195665 -0. 
928123 -1. 
414034 1. 


D 


395985 
166599 
136473 
616981 
623033 
600434 


See the sections on Hierarchical Indexing and Reshaping. 
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Stack 


In [90]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 


—— "£00"; "foo"; 'gux', 'gux']; 
cr ['one', 'two', 'one', 'two', 
Er 'one', 'two', 'one', 'two']])) 


In [91]: index - pd.MultiIndex.from tuples(tuples, names-['first', 'second']) 
In [92]: df = pd.DataFrame(np.random.randn(8, 2), index-index, columns-['A', 'B'] 
In [93]: df2 - df[:4] 


In [94]: df2 


Out[94]: 
A B 
first second 
bar one 0.029399 -0.542108 
two 0.282696 -0.087302 
baz one -1.575170 1.771208 
two 0.816482 1.100230 


The stack() method “compresses” a level in the DataFrame’s columns. 


In [95]: stacked = df2.stack() 


In [96]: stacked 


Out[96]: 
first second 
bar one A 0.029399 
B -0.542108 
two A 0.282696 
B -0.087302 
baz one A -1.575170 
B 1.771208 
two A 0.816482 
B 1.100230 


dtype: float64 


With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of 
stack() iS unstack(), which by default unstacks the last level: 


In [97]: stacked.unstack() 


Out[97]: 
A B 
first second 
bar one 0.029399 -0.542108 
two 0.282696 -0.087302 
baz one -1.575170 1.771208 
two 0.816482 1.100230 


In [98]: stacked.unstack(1) 
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Out[98]: 
second 
first 
bar 


baz 


U pb uU 


In [99]: 
Out[59]: 
first 
second 
one 


two 


A 
B 
A 
B 


one two 
-029399 0.282696 
.542108 -0.087302 
-575170 0.816482 
771208 1.100230 


stacked.unstack(0) 


bar baz 
.029399 -1.575170 
.542108 1.771208 
-282696 0.816482 
-087302 1.100230 


Pivot Tables 


See the section on Pivot Tables. 


In [100]: df = pd.DataFrame({'A' 
TERE 'B 
—— uC 
Sea aod "D 
Fr 'E 

In [101]: df 

Out[101]: 

A B C D 

0 one A foo 1.418757 -0 

1 one B foo -1.879024 1 

2 two C foo 0.536826 -0 

3 three A bar 1.006160 O 

4 one B bar -0.029716 0 

5 one C bar -1.146178 -0 

6 two A foo 0.100900 -1 

7 three B foo -1.035018 1 

8 one C foo 0.314665 -0 

9 one A bar -0.773723 1 

10 two B bar -1.170653 0 

11 three C bar 0.648740 1 
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: ['one', 
' : ['A', 
' ['foo', 


E 


. 179666 
- 291836 
-009614 
- 392149 
. 264599 


. 057409 
.425638 
. 024098 
. 106062 
.824375 
595974 


.167115 


'B', 


'one', 'two', 'three'] * 3, 
Ce * 4, 
'foo', 'foo', 'bar', 'bar', 'bar 


We can produce pivot tables from this data very easily: 


In [102]: pd.pivot_table(df, values-'D', index-['A', 


Out[102]: 

C 

A B 

one A -0. 
B -0. 
c -1. 


three A 1. 


bar foo 
773723 1.418757 
029716 -1.879024 
146178 0.314665 
006160 NaN 
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B NaN -1.03 
C 0.648740 

two A NaN 0.10 
B -1.170653 
Cc NaN 0.53 


Time Series 


pandas has simple, powerful, and efficient functionality for performing resampling operations during 
frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely 


5018 
NaN 
0900 
NaN 
6826 
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common in, but not limited to, financial applications. See the Time Series section 


In [103]: 


In [104]: 


In [105]: ts.resample('5Min', how='sum' ) 


Out[105]: 
2012-01-01 


rng 


ts 


= pd.Series(np.random.randint(0, 


25 


Freq: 5T, dtype: 


pd.date range('l/1/2012', periods-100, freq='S') 


083 
int32 


Time zone representation 


In [106]: 
In [107]: 


In [108]: 

Out[108]: 

2012-03-06 
2012-03-07 
2012-03-08 
2012-03-09 
2012-03-10 


rng 


ts 


ts 


500, 


len(rng)), index=rng) 


pd.date range('3/6/2012 00:00', periods-5, freg-'D') 


= pd.Series(np.random.randn(len(rng)), rng) 


0. 
0. 
-0. 
0. 
-2. 


Freq: D, dtype: 


In [109]: 


In [110]: 
Out[110]: 

2012-03-06 
2012-03-07 
2012-03-08 
2012-03-09 
2012-03-10 


ts_utc 


ts_utc 


00: 
00: 
00: 
00: 
00: 


Freg: D, dtype: 


00: 
00: 
00: 
00: 
00: 


464000 
227371 
496922 
306389 
290613 
float64 


= ts.tz localize('UTC') 


00-00: 
00-00: 
00-00: 
004-00: 
00-00: 
float64 


Convert to another time zone 


In [111]: ts_utc.tz convert('US/Eastern') 


Out[111]: 


00 
00 
00 
00 
00 
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2012-03-05 19:00:00-05:00 0.464000 
2012-03-06 19:00:00-05:00 0.227371 
2012-03-07 19:00:00-05:00 -0.496922 
2012-03-08 19:00:00-05:00 0.306389 
2012-03-09 19:00:00-05:00 -2.290613 
Freq: D, dtype: float64 


Converting between time span representations 


In [112]: rng = pd.date_range('1/1/2012', periods=5, freq='M') 
In [113]: ts = pd.Series(np.random.randn(len(rng)), index=rng) 


In [114]: ts 


Out[114]: 
2012-01-31 -1.134623 
2012-02-29 -1.561819 


2012-03-31 -0.260838 
2012-04-30 0.281957 
2012-05-31 1.523962 
Freq: M, dtype: float64 


In [115]: ps = ts.to period() 


In [116]: ps 


Out[116]: 
2012-01 -1.134623 
2012-02 -1.561819 


2012-03 -0.260838 
2012-04 0.281957 
2012-05 1.523962 
Freq: M, dtype: float64 


In [117]: ps.to timestamp() 


Out[117]: 
2012-01-01 —1.134623 
2012-02-01 -1.561819 


2012-03-01 -0.260838 
2012-04-01 0.281957 
2012-05-01 1.523962 
Freq: MS, dtype: float64 


Converting between period and timestamp enables some convenient arithmetic functions to be 
used. In the following example, we convert a guarterly freguency with year ending in November to 
9am of the end of the month following the guarter end: 


In [118]: prng - pd.period range('199001', '200004', freg-'O-NOV') 

In [119]: ts = pd.Series(np.random.randn(len(prng)), prng) 

In [120]: ts.index = (prng.asfreq('M', ‘e') + 1).asfreq('H', 's') + 9 
In [121]: ts.head() 


Out[121]: 
1990-03-01 09:00 -0.902937 
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1990-06-01 09:00 0.068159 
1990-09-01 09:00 -0.057873 
1990-12-01 09:00 -0.368204 
1991-03-01 09:00 -1.144073 
Freq: H, dtype: float64 


Categoricals 


Since version 0.15, pandas can include categorical data in a DataFrame. For full docs, see the 
categorical introduction and the API documentation. 


In [122]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw grade":['a', 'b', 'b', 'a', 


Convert the raw grades to a categorical data type. 


In [123]: df["grade"] = df["raw_grade"].astype("category" ) 


In [124]: df["grade"] 
Out[124]: 


0 
1 
2 
3 
4 
5 


Name: grade, dtype: category 
Categories (3, object): [a, b, e] 


ooo oom 


Rename the categories to more meaningful names (assigning to Series.cat.categories is 
inplace!) 


In [125]: df["grade"].cat.categories = ["very good", "good", "very bad"] 


Reorder the categories and simultaneously add the missing categories (methods under Series 
.cat return a new Series per default). 


In [126]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "mediu 


In [127]: df["grade"] 


Out[127]: 

0 very good 

1 good 

2 good 

3 very good 

4 very good 

5 very bad 

Name: grade, dtype: category 

Categories (5, object): [very bad, bad, medium, good, very good] 
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Sorting is per order in the categories, not lexical order. 


In [128]: df.sort_ values(by-"grade") 


Out[128]: 
id raw grade grade 
6 very bad 
good 
good 


very good 
very good 
very good 


GO ON FE UI 
OBE WN 
Pop ooroo 


Grouping by a categorical column shows also empty categories. 


In [129]: df.groupby("grade").size() 
Ont[129]s 
grade 

very bad 
bad 

medium 

good 

very good 
dtype: int64 


oe ooonm— 


Plotting 


Plotting docs. 


In [130]: ts = pd.Series(np.random.randn(1000), index-pd.date range('1/1/2000', p 


In [131]: ts = ts.cumsum() 


In [132]: ts.plot() 
Out[132]: <matplotlib.axes. subplots.AxesSubplot at 0xae3696ac> 
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10 -I i ' i ‘ ' i ' i i ' i | ' i ' i i ' ' i ' i ' 1 i ' i ' ' ' ' ' _ 


—10 - a 
—20 - _ 
—30 - m 
—40 - - 
—50 - a 
—60 - _ 


-70a y ——— Y —y—y——y—y—y——— y —y———— r 
Jan Jul Jan Jul Jan Jul 
2000 2001 2002 


On DataFrame, plot () is a convenience to plot all of the columns with labels: 


In [133]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, 
Gas columns-['A', 'B', 'C', 'D']) 


In [134]: df df.cumsum( ) 


In [135]: plt.figure(); df.plot(); plt.legend(loc-'best') 
Out[135]: <matplotlib.legend.Legend at 0xab53b26c> 
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60 - | I 
— A 
— B 
40 -ME— 
— D 
20 - 
0- 
AU 
A 
—20 - 
~ Iy Ng 
U 

—40 - 

—60 = —_ — — 
Jan Jul Jan Jul Jan Jul 
2000 2001 2002 

Getting Data In/Out 

CSV 

Writing to a csv file 

In [136]: df.to csv('foo.csv') 

Reading from a csv file 

In [137]: pd.read csv('foo.csv') 

Out[137]: 

Unnamed: 0 A B Cc D 

0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860 

1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953 

2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536 

3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896 
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4 
5 
6 
993 
994 
995 
996 
997 


998 
999 


2000-01-05 
2000-01-06 
2000-01-07 


2002-09-20 
2002-09-21 
2002-09-22 
2002-09-23 
2002-09-24 
2002-09-25 
2002-09-26 


o 


-10 


-10. 
-8. 
-9. 
-9. 

-10. 

-11. 
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578117 0. 
.478344 0. 
.235339 -0. 
.628548 -9. 
390377 -8. 
985362 -8. 
558560 -8. 
902058 -9. 
216020 -9. 
856774 -10. 


511371 
449933 
091757 
153563 
727491 
485624 
781216 
340490 


480682 
671012 


0. 
-0. 
-1. 


-7 


103552 
741620 
543861 


- 883146 
-6. 
-4. 
-4. 
-4. 
-3. 
-3. 


399645 
669462 
499815 
386639 
933802 
216025 


-2 


28 


-428202 
-1. 
-1. 


962409 
084753 


- 313940 
30. 
31. 
30. 
30. 
29. 
29; 


914107 
367740 
518439 
105593 
758560 
369368 


[1000 rows x 5 columns] 


HDF5 


Reading and writing to HDFStores 


Writing to a HDF5 Store 


In [138]: df.to hdf('foo.h5','df') 


Reading from a HDF5 Store 


In [139]: pd.read hdf('foo.h5','df') 


Out[139]: 


2000-01-01 
2000-01-02 
2000-01-03 
2000-01-04 
2000-01-05 
2000-01-06 
2000-01-07 


2002-09-20 
2002-09-21 
2002-09-22 
2002-09-23 
2002-09-24 
2002-09-25 
2002-09-26 


[1000 rows 


Excel 


A 
.266457 -0. 
.170732 -0. 
- 734933 0. 
.555121 1. 
.578117 0. 
-478344 0. 
.235339 -0. 
.628548 -9. 
.390377 -8. 
.985362 -8. 
.558560 -8. 
.902058 -9. 
-216020 -9. 
-856774 -10. 
columns ] 


Reading and writing to MS Excel 


Writing to an excel file 
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B 
399641 
345873 
530468 
452620 
511371 
449933 
091757 
153563 
727491 
485624 
781216 
340490 
480682 
671012 


C 
.219582 Le 
-653061 -0 
-060811 -0. 
-239859 -l. 
-103552 -2. 
-741620 -l. 
-543861 -1. 
-883146 28. 
-399645 30. 
-669462 31. 
-499815 30. 
-386639 30. 
.933802 29. 
.216025 29. 


D 
186860 


.282953 


515536 
156896 
428202 
962409 
084753 
313940 
914107 
367740 
518439 
105593 
758560 
369368 
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In [140]: df.to excel('foo.xlsx', sheet name-'Sheetl') 
Reading from an excel file 


In [141]: pd.read excel('foo.xlsx', 'Sheetl', index col-None, na values-r['NA']) 
Out[141]: 
A B C D 
2000-01-01 0.266457 -0.399641 -0.219582 1.186860 
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953 
2000-01-03 -1.734933 0.530468 2.060811 -0.515536 
2000-01-04 -1.555121 1.452620 0.239859 -1.156896 
2000-01-05 0.578117 0.511371 0.103552 -2.428202 
2000-01-06 0.478344 0.449933 -0.741620 -1.962409 
2000-01-07 1.235339 -0.091757 -1.543861 -1.084753 


2002-09-20 -10.628548 -9.153563 -7.883146 28.313940 
2002-09-21 -10.390377 -8.727491 -6.399645 30.914107 
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740 
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439 
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593 
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560 
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368 


[1000 rows x 4 columns] 


Gotchas 


If you are trying an operation and you see an exception like: 


>>> if pd.Series([False, True, False]): 
print("I was true") 
Traceback 


ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.a 


See Comparisons for an explanation and what to do. 


See Gotchas as well. 


http://pandas.pydata.org/pandas-docs/stable/10min.html 26/26 


